Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Abstract Linking sequence-derived microbial taxa abundances to host (patho-)physiology or habitat characteristics in a reproducible and interpretable manner has remained a formidable challenge for the analysis of microbiome survey data. Here, we introduce a flexible probabilistic modeling framework, VI-MIDAS (variational inference for microbiome survey data analysis), that enables joint estimation of context-dependent drivers and broad patterns of associations of microbial taxon abundances from microbiome survey data. VI-MIDAS comprises mechanisms for direct coupling of taxon abundances with covariates and taxa-specific latent coupling, which can incorporate spatio-temporal information and taxon–taxon interactions. We leverage mean-field variational inference for posterior VI-MIDAS model parameter estimation and illustrate model building and analysis using Tara Ocean Expedition survey data. Using VI-MIDAS’ latent embedding model and tools from network analysis, we show that marine microbial communities can be broadly categorized into five modules, including SAR11-, nitrosopumilus-, and alteromondales-dominated communities, each associated with specific environmental and spatiotemporal signatures. VI-MIDAS also finds evidence for largely positive taxon–taxon associations in SAR11 or Rhodospirillales clades, and negative associations with Alteromonadales and Flavobacteriales classes. Our results indicate that VI-MIDAS provides a powerful integrative statistical analysis framework for discovering broad patterns of associations between microbial taxa and context-specific covariate data from microbiome survey data.more » « less
-
Heterotrophic bacteria and archaea (“heteroprokaryotes”) drive global carbon cycling, but how to quantitatively organize their functional complexity remains unclear. We generated a global-scale understanding of marine heteroprokaryotic functional biogeography by synthesizing genetic sequencing data with a mechanistic marine ecosystem model. We incorporated heteroprokaryotic diversity into the trait-based model along two axes: substrate lability and growth strategy. Using genetic sequences along three ocean transects, we compiled 21 heteroprokaryotic guilds and estimated their degree of optimization for rapid growth (copiotrophy). Data and model consistency indicated that gradients in grazing and substrate lability predominantly set biogeographical patterns, and we identified deep-ocean “slow copiotrophs” whose ecological interactions control the surface accumulation of dissolved organic carbon.more » « lessFree, publicly-accessible full text available May 22, 2026
-
Abstract We introduce the Global rRNA Universal Metabarcoding Plankton database (GRUMP), which consists of 1194 samples that were collected from 2003–2020 and cover extensive latitudinal and longitudinal transects, as well as depth profiles in all major ocean basins. DNA from unfractionated (>0.2 µm) seawater samples was amplified using the 515Y/926 R universal three-domain rRNA gene primers, simultaneously quantifying the relative abundance of amplicon sequencing variants (ASVs) from bacteria, archaea, eukaryotic nuclear 18S, and eukaryotic plastid 16S. Thus, the ratio between taxa in one sample is directly comparable to the ratio in any other GRUMP sample, regardless of gene copy number differences. This obviates a problem in prior global studies that used size-fractionation and different rRNA gene primers for bacteria, archaea, and eukaryotes, precluding comparisons across size fractions or domains. On average, bacteria contributed 71%, eukaryotes 19%, and archaea 8% to rRNA gene abundance, though eukaryotes contributed 32% at latitudes >40°. GRUMP is publicly available on the Simons Collaborative Marine Atlas Project (CMAP), promoting the global comparison of marine microbial dynamics.more » « less
-
Abstract Microbial ecological functions are an emergent property of community composition. For some ecological functions, this link is strong enough that community composition can be used to estimate the quantity of an ecological function. Here, we apply random forest regression models to compare the predictive performance of community composition and environmental data for bacterial production (BP). Using data from two independent long-term ecological research sites—Palmer LTER in Antarctica and Station SPOT in California—we found that community composition was a strong predictor of BP. The top performing model achieved an R2 of 0.84 and RMSE of 20.2 pmol L−1 hr−1 on independent validation data, outperforming a model based solely on environmental data (R2 = 0.32, RMSE = 51.4 pmol L−1 hr−1). We then operationalized our top performing model, estimating BP for 346 Antarctic samples from 2015 to 2020 for which only community composition data were available. Our predictions resolved spatial trends in BP with significance in the Antarctic (P value = 1 × 10−4) and highlighted important taxa for BP across ocean basins. Our results demonstrate a strong link between microbial community composition and microbial ecosystem function and begin to leverage long-term datasets to construct models of BP based on microbial community composition.more » « less
-
Abstract Sequence classification facilitates a fundamental understanding of the structure of microbial communities. Binary metagenomic sequence classifiers are insufficient because environmental metagenomes are typically derived from multiple sequence sources. Here we introduce a deep-learning based sequence classifier, DeepMicroClass, that classifies metagenomic contigs into five sequence classes, i.e. viruses infecting prokaryotic or eukaryotic hosts, eukaryotic or prokaryotic chromosomes, and prokaryotic plasmids. DeepMicroClass achieved high performance for all sequence classes at various tested sequence lengths ranging from 500 bp to 100 kbps. By benchmarking on a synthetic dataset with variable sequence class composition, we showed that DeepMicroClass obtained better performance for eukaryotic, plasmid and viral contig classification than other state-of-the-art predictors. DeepMicroClass achieved comparable performance on viral sequence classification with geNomad and VirSorter2 when benchmarked on the CAMI II marine dataset. Using a coastal daily time-series metagenomic dataset as a case study, we showed that microbial eukaryotes and prokaryotic viruses are integral to microbial communities. By analyzing monthly metagenomes collected at HOT and BATS, we found relatively higher viral read proportions in the subsurface layer in late summer, consistent with the seasonal viral infection patterns prevalent in these areas. We expect DeepMicroClass will promote metagenomic studies of under-appreciated sequence types.more » « less
-
Cyanophages exert important top-down controls on their cyanobacteria hosts; however, concurrent analysis of both phage and host populations is needed to better assess phage–host interaction models. We analyzed picocyanobacteria Prochlorococcus and Synechococcus and T4-like cyanophage communities in Pacific Ocean surface waters using five years of monthly viral and cellular fraction metagenomes. Cyanophage communities contained thousands of mostly low-abundance (<2% relative abundance) species with varying temporal dynamics, categorized as seasonally recurring or non-seasonal and occurring persistently, occasionally, or sporadically (detected in ≥85%, 15-85%, or <15% of samples, respectively). Viromes contained mostly seasonal and persistent phages (~40% each), while cellular fraction metagenomes had mostly sporadic species (~50%), reflecting that these sample sets capture different steps of the infection cycle—virions from prior infections or within currently infected cells, respectively. Two groups of seasonal phages correlated to Synechococcus or Prochlorococcus were abundant in spring/summer or fall/winter, respectively. Cyanophages likely have a strong influence on the host community structure, as their communities explained up to 32% of host community variation. These results support how both seasonally recurrent and apparent stochastic processes, likely determined by host availability and different host-range strategies among phages, are critical to phage–host interactions and dynamics, consistent with both the Kill-the-Winner and the Bank models.more » « less
-
Abstract Free-living and particle-associated marine prokaryotes have physiological, genomic, and phylogenetic differences, yet factors influencing their temporal dynamics remain poorly constrained. In this study, we quantify the entire microbial community composition monthly over several years, including viruses, prokaryotes, phytoplankton, and total protists, from the San-Pedro Ocean Time-series using ribosomal RNA sequencing and viral metagenomics. Canonical analyses show that in addition to physicochemical factors, the double-stranded DNA viral community is the strongest factor predicting free-living prokaryotes, explaining 28% of variability, whereas the phytoplankton (via chloroplast 16S rRNA) community is strongest with particle-associated prokaryotes, explaining 31% of variability. Unexpectedly, protist community explains little variability. Our findings suggest that biotic interactions are significant determinants of the temporal dynamics of prokaryotes, and the relative importance of specific interactions varies depending on lifestyles. Also, warming influenced the prokaryotic community, which largely remained oligotrophic summer-like throughout 2014–15, with cyanobacterial populations shifting from cold-water ecotypes to warm-water ecotypes.more » « less
-
Abstract The introduction of high-throughput chromosome conformation capture (Hi-C) into metagenomics enables reconstructing high-quality metagenome-assembled genomes (MAGs) from microbial communities. Despite recent advances in recovering eukaryotic, bacterial, and archaeal genomes using Hi-C contact maps, few of Hi-C-based methods are designed to retrieve viral genomes. Here we introduce ViralCC, a publicly available tool to recover complete viral genomes and detect virus-host pairs using Hi-C data. Compared to other Hi-C-based methods, ViralCC leverages the virus-host proximity structure as a complementary information source for the Hi-C interactions. Using mock and real metagenomic Hi-C datasets from several different microbial ecosystems, including the human gut, cow fecal, and wastewater, we demonstrate that ViralCC outperforms existing Hi-C-based binning methods as well as state-of-the-art tools specifically dedicated to metagenomic viral binning. ViralCC can also reveal the taxonomic structure of viruses and virus-host pairs in microbial communities. When applied to a real wastewater metagenomic Hi-C dataset, ViralCC constructs a phage-host network, which is further validated using CRISPR spacer analyses. ViralCC is an open-source pipeline available athttps://github.com/dyxstat/ViralCC.more » « less
-
Bacteria are single-celled organisms that live out their lives at a microscopic scale. We can find bacteria everywhere we look for them, including inside of our own bodies. Bacteria are incredibly diverse and come in many shapes and sizes. They also vary widely in how they live and grow. Some bacteria grow very quickly and others grow slowly. We wanted to measure the growth of many different types of bacteria in the environment. Unfortunately, some species of bacteria are very difficult to grow in the laboratory. To get around this, we designed a method to predict how fast a type of bacteria can grow, just from its DNA. This way, if we have the DNA of a bacterial species, we can measure its growth even if we cannot get it to grow in our laboratory.more » « less
-
Abstract MotivationPhage–host associations play important roles in microbial communities. But in natural communities, as opposed to culture-based lab studies where phages are discovered and characterized metagenomically, their hosts are generally not known. Several programs have been developed for predicting which phage infects which host based on various sequence similarity measures or machine learning approaches. These are often based on whole viral and host genomes, but in metagenomics-based studies, we rarely have whole genomes but rather must rely on contigs that are sometimes as short as hundreds of bp long. Therefore, we need programs that predict hosts of phage contigs on the basis of these short contigs. Although most existing programs can be applied to metagenomic datasets for these predictions, their accuracies are generally low. Here, we develop ContigNet, a convolutional neural network-based model capable of predicting phage–host matches based on relatively short contigs, and compare it to previously published VirHostMatcher (VHM) and WIsH. ResultsOn the validation set, ContigNet achieves 72–85% area under the receiver operating characteristic curve (AUROC) scores, compared to the maximum of 68% by VHM or WIsH for contigs of lengths between 200 bps to 50 kbps. We also apply the model to the Metagenomic Gut Virus (MGV) catalogue, a dataset containing a wide range of draft genomes from metagenomic samples and achieve 60–70% AUROC scores compared to that of VHM and WIsH of 52%. Surprisingly, ContigNet can also be used to predict plasmid-host contig associations with high accuracy, indicating a similar genetic exchange between mobile genetic elements and their hosts. Availability and implementationThe source code of ContigNet and related datasets can be downloaded from https://github.com/tianqitang1/ContigNet.more » « less
An official website of the United States government
